Introduction and Setup


Source: media.giphy

Source: media.giphy


For the past several months, I have been working at a pizza place as a delivery driver. This dataset "pizza_data.csv" is data I have collected over the past few weeks. The data includes 9 different variables described below.

Variables

  1. Date: The date of the pizza delivery
  2. area: The section location of the delivery. Sections 1-16 and "MAIN" are designated by the store
  3. apartment_or_house: Designates whether the delivery was to a house or apartment
  4. street_name: name or address of pizza delivery location
  5. time: time that the order left the store
  6. total: total cost of pizza order, without the tip
  7. tip_method: How the customer tipped the driver, with three options of "pre_paid", "write_in" (customer signed receipt), and "cash_order"(customer did not pre-pay for entire order, instead paid with cash + tip at door)
  8. hours_worked: The number of hours worked in that shift for that day
  9. position_worked: The title of the position worked for that shift. Options include "Rush"(usually the shortest shift, starting anywhere from 4-6pm and ending anywhere from 6~9pm), "Late" (usually begins around 5-6pm, and ends around 9-11pm), and "Close" (usually begins around 5-6pm, ends around 1-2am)


There are 282 delivery observations over 20 shifts. I typically work rush or late shifts, but occasionally close. For this reason, there are more days that are "rush" or "late" than "close," (a total of 5 days for close).


setwd("/Users/emilyreed/Desktop")
pizza <- read.csv("pizza_data.csv")
glimpse(pizza)
## Rows: 282
## Columns: 10
## $ Date               <chr> "10/8/20", "10/8/20", "10/8/20", "10/8/20", "10/8/…
## $ area               <chr> "11", "MAIN", "5", "15", "12", "MAIN", "MAIN", "15…
## $ apartment_or_house <chr> "H", "A", "H", "A", "H", "A", "A", "A", "A", "A", …
## $ street_name        <chr> "cottonweed Trl", "350 Cypress Creek", "Tattler Dr…
## $ time               <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ total              <dbl> 41.06, 26.48, 21.00, 14.05, 23.78, 25.95, 21.08, 1…
## $ tip                <chr> "0", "5.3", "3.15", "5", "2.38", "3.89", "3.92", "…
## $ tip_method         <chr> "write in", "pre-tip", "pre-tip", "pre-tip", "pre-…
## $ hours_worked       <dbl> 4.75, 4.75, 4.75, 4.75, 4.75, 4.75, 4.75, 4.75, 4.…
## $ position_worked    <chr> "Late", "Late", "Late", "Late", "Late", "Late", "L…


I wanted to look at tips as a percentage rather than just an amount, so I created a new variable called "tip_perc" to see the percentage of tip compared to the whole total.

pizza <- pizza %>% mutate(tip = as.numeric(tip)) %>% 
    mutate(tip_perc = ((tip * 100)/total)) %>% filter(tip_perc != 
    Inf)


I also wanted to see how much money I made in tips per day as a rate per hour:

# first making rate a new variable
pizza2 <- pizza %>% group_by(Date) %>% filter(!is.na(tip)) %>% 
    filter(!is.na(total)) %>% mutate(total_tips = sum(tip)) %>% 
    mutate(rate = (total_tips/hours_worked))

# now displaying each rate in descending order
pizza %>% group_by(Date) %>% filter(!is.na(tip)) %>% 
    filter(!is.na(total)) %>% mutate(total_tips = sum(tip)) %>% 
    mutate(rate = (total_tips/hours_worked)) %>% summarize(rate = unique(rate)) %>% 
    arrange(desc(rate))
## # A tibble: 20 x 2
##    Date      rate
##    <chr>    <dbl>
##  1 10/22/20 17.7 
##  2 11/5/20  17.3 
##  3 11/1/20  17.3 
##  4 11/15/20 15.8 
##  5 10/16/20 15.1 
##  6 10/23/20 14.3 
##  7 10/29/20 12.7 
##  8 10/10/20 12.6 
##  9 10/9/20  12.4 
## 10 10/15/20 11.2 
## 11 10/30/20 10.9 
## 12 11/8/20  10.8 
## 13 10/11/20 10.6 
## 14 10/31/20 10.6 
## 15 10/18/20 10.5 
## 16 11/13/20 10.4 
## 17 11/14/20 10.1 
## 18 10/8/20   9.78
## 19 11/12/20  9.02
## 20 10/25/20  8.65
# how many deliveries did I take each day and how
# much money did I earn in tips?
pizza %>% group_by(Date) %>% filter(!is.na(tip)) %>% 
    summarize(deliveries = n(), tip_total = sum(tip)) %>% 
    arrange(desc(deliveries))
## # A tibble: 20 x 3
##    Date     deliveries tip_total
##    <chr>         <int>     <dbl>
##  1 10/9/20          28     112. 
##  2 11/14/20         25      88.2
##  3 10/31/20         21      90.1
##  4 11/12/20         21      81.2
##  5 11/13/20         20      88.1
##  6 10/22/20         17      83.8
##  7 10/23/20         17      75.2
##  8 10/16/20         14      64.1
##  9 10/8/20          14      46.5
## 10 10/15/20         13      53.0
## 11 10/30/20         13      49.2
## 12 11/8/20          13      53.8
## 13 11/1/20          11      51.8
## 14 11/15/20         10      55.2
## 15 11/5/20          10      43.2
## 16 10/10/20          9      37.9
## 17 10/11/20          9      32.0
## 18 10/29/20          7      38.1
## 19 10/18/20          5      18.3
## 20 10/25/20          3      10.8

MANOVA

Alright, for the MANOVA I wanted to see if rate of tips per hour and tip percentage differed significantly by the position I worked. Since all other numeric variables were used in the creation of tip_perc and rate, I decided to omit them and just focus on the two listed above.

  • Hypotheses:
    • H0: for each position worked (rush, late, close), the means for all groups (rate and tip_perc) are equal
    • Ha: for at least one position, at least 1 group (rate and/or tip_perc) mean differs.
# Running the MANOVA
pizza_no_NAs <- pizza2 %>% na.omit
man1 <- manova(cbind(rate, tip_perc) ~ position_worked, 
    data = pizza_no_NAs)
summary(man1)  #significant!!!!!!!
##                  Df  Pillai approx F num Df den Df    Pr(>F)    
## position_worked   2 0.34455   24.247      4    466 < 2.2e-16 ***
## Residuals       233                                             
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary.aov(man1)  #shows that rate is significant, not tip percentage with position worked
##  Response rate :
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## position_worked   2  608.47 304.235  60.495 < 2.2e-16 ***
## Residuals       233 1171.78   5.029                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response tip_perc :
##                  Df Sum Sq Mean Sq F value Pr(>F)
## position_worked   2    282  141.24  0.3327 0.7173
## Residuals       233  98904  424.48
# which position?
pairwise.t.test(pizza_no_NAs$rate, pizza_no_NAs$position_worked, 
    p.adj = "none")
## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  pizza_no_NAs$rate and pizza_no_NAs$position_worked 
## 
##      Close  Late
## Late 4e-15  -   
## Rush <2e-16 0.27
## 
## P value adjustment method: none
# probability of making a type 1 error
Pr_type1error = 1 - 0.95^5
Pr_type1error
## [1] 0.2262191
# adjusted significance level, still significant
a = 0.05/5


Five statistical tests were performed. The probability of making a type 1 error was 22.6%, so that also must be taken in to consideration. To account for this, the adjusted significance level was calculated to be .01. Using this number, late and close are significantly different in rate per hour (p value<.01), as well as rush and close (p value<.01). However, rush and late are not significantly different from each other in rate or tip_perc (p value>.01).

Checking Assumptions

group <- pizza_no_NAs$position_worked
DVs <- pizza_no_NAs %>% ungroup() %>% select(rate, 
    tip_perc)

# Test multivariate normality for each group
sapply(split(DVs, group), mshapiro_test)
##           Close        Late         Rush        
## statistic 0.2754013    0.4345129    0.4965609   
## p.value   6.971036e-20 2.062022e-14 2.564588e-14
# check for mulitvariate normality via graph
ggplot(pizza_no_NAs, aes(x = rate, y = tip_perc)) + 
    geom_point(alpha = 0.5) + geom_density2d(aes(color = position_worked)) + 
    facet_wrap(~position_worked)

pizza_no_NAs <- pizza_no_NAs %>% relocate(tip_perc, 
    .after = total_tips)

# check for homogenity of covariances (for fun,
# even though multivariate normality is violated)
# Box's M test
box_m(DVs, group)
## # A tibble: 1 x 4
##   statistic  p.value parameter method                                           
##       <dbl>    <dbl>     <dbl> <chr>                                            
## 1      87.7 9.13e-17         6 Box's M-test for Homogeneity of Covariance Matri…
# View covariance matrices for each group
lapply(split(DVs, group), cov)
## $Close
##               rate    tip_perc
## rate     1.3344802   0.3587544
## tip_perc 0.3587544 589.4262426
## 
## $Late
##               rate   tip_perc
## rate      8.610311  -1.372266
## tip_perc -1.372266 369.783763
## 
## $Rush
##               rate   tip_perc
## rate      6.900647  -3.266294
## tip_perc -3.266294 246.119125

MANOVA Assumptions:

  • Random Samples: All of our deliveries are randomly assigned to each driver.
  • Multivariate Normality: P values of multivariate test are p<.05, meaning the null assumption that normality is met is rejected. Graphs also indicate normality is violated.
  • Homogeneity of within-group covariance matrices: P-value<.05 for mox M's test, meaning homogenity assumption is violated. The matrices look very different
  • Linear relationships among DVs: nope. higher tip_perc for one delivery does not mean overall rate for the day will be higher
  • No extreme univariate or multivariate outliers: there are some large outliers in tip_perc
  • No multicollinearity (i.e., DVs should not be too correlated): rate and tip_perc should be different enough


Overall, the data does not meet the assumptions for a valid MANOVA to be possible.



Randomization Test


I wanted to see if housing type (aka apartment or house) was significantly related to tip percentage.

  • Hypotheses:
    • H0: housing type (apartment of house) is not significantly related to tip percentage
    • Ha: housing type (apartment of house) is significantly related with tip percentage
# Finding average percentaged tipped in apartments
# and houses
pizza_no_NAs %>% group_by(apartment_or_house) %>% summarize(means = mean(tip_perc))
## # A tibble: 2 x 2
##   apartment_or_house means
##   <chr>              <dbl>
## 1 A                   16.3
## 2 H                   20.2
# scaling tip perc
pizza_scale <- pizza_no_NAs %>% mutate_if(is.numeric, 
    scale)

# Finding the observed difference in means
pizza_scale %>% group_by(apartment_or_house) %>% summarize(means = mean(tip_perc)) %>% 
    summarize(mean_diff = diff(means))
## # A tibble: 1 x 1
##   mean_diff
##       <dbl>
## 1     0.161


On average, 'houses' tipped 20.24% of the total price, while apartments tipped 16.25% on average. Houses tipped ~4% more than apartments, based on total price of the order. Using the scaled data, the mean difference between housing type was .1614303, which will be used to calculate the two tailed p-value.


# Randomization test. Scrambling houses and
# apartments and their tip percentages

set.seed(12)
rand_dist <- vector()

for (i in 1:5000) {
    new <- data.frame(tip_perc = sample(pizza_scale$tip_perc), 
        apartment_or_house = pizza_no_NAs$apartment_or_house)
    rand_dist[i] <- mean(new[new$apartment_or_house == 
        "H", ]$tip_perc) - mean(new[new$apartment_or_house == 
        "A", ]$tip_perc)
}


# finding the two tailed p value:
mean(rand_dist > 0.1614303 | rand_dist < -0.1614303)  #fail to reject H0, there is no significant difference between apartment and house tip percentages
## [1] 0.2302

The two tailed p-value=.2302 (pvalue>.05), so we fail to reject the null hypothesis. There is no significant difference in tip percentage between houses and apartments based off their total order amount.

Generating histograms


# histogram for observed tip percentages for
# apartment vs houses
plot1 <- ggplot(pizza_scale, aes(x = tip_perc, fill = apartment_or_house)) + 
    geom_histogram(bins = 7) + facet_wrap(~apartment_or_house) + 
    ggtitle("Observed Values")

# histogram for random tip percentages for
# apartment vs houses
plot2 <- ggplot(new, aes(x = tip_perc, fill = apartment_or_house)) + 
    geom_histogram(bins = 7) + facet_wrap(~apartment_or_house) + 
    ggtitle("Randomization Test")

grid.arrange(plot1, plot2, ncol = 2, top = "Observed vs Randomization tip percentages")

# visualize test stat

{
    hist(rand_dist, main = "", ylab = "")
    abline(v = c(-0.1614303, 0.1614303), col = "red")
}

The histogram depicts the distance of means in the random data set (created by scrambling the original data), as well as the original observed difference in means (red vertical lines). Most of the random data is within the original data observed difference of means. There is lots of data that falls beyond the observed difference (reflected in our large calculated p-value of .2302).


Linear Regression


I wanted to see if I could predict tip percentage from area and time. Due to the fact that both 'area' and 'time' had so many values, I created new variables to hopefully succinctly describe the data. For 'area' I created a new variable called "new_area_class" based on the average cost of the houses/apartments, which has four variables (instead of 17) of "apartments", "lower", "middle", and "higher". I also created a new variable from 'time' called "time_group", which is chunked by two-hour intervals. The three possible values for time_group are "early rush" (4-6pm), "late_rush" (6-8pm), and "night"(8pm-1am).

*note: I will not be centering tip_percentage because it is possible to receive a 0% tip, and therefore an understandable intercept.

  • Hypotheses:
    • H01: controlling for area group, time doesn't explain variation in percent tip
    • H02 controlling for time, area group does not explain in tip percent
# creating new variable of new area class to group
# together average housing costs
pizza_new_area <- pizza2 %>% filter(!is.na(area)) %>% 
    filter(!is.na(tip_perc)) %>% mutate(new_area_class = recode(area, 
    `13` = "lower", `14` = "lower", `1` = "lower", 
    MAIN = "apartments", `15` = "apartments", `4` = "middle", 
    `11` = "middle", `2` = "middle", `9` = "middle", 
    `12` = "middle", `10` = "middle", `6` = "higher", 
    `7` = "higher", `5` = "higher", `8` = "higher", 
    `3` = "higher", `16` = "higher"))

# see how many observations are in each area
pizza_new_area %>% group_by(new_area_class) %>% summarize(n())
## # A tibble: 4 x 2
##   new_area_class `n()`
##   <chr>          <int>
## 1 apartments        75
## 2 higher            67
## 3 lower             39
## 4 middle            81
# average tip percentage based off of new area
# classes
pizza_new_area %>% group_by(new_area_class) %>% summarize(mean(tip_perc))
## # A tibble: 4 x 2
##   new_area_class `mean(tip_perc)`
##   <chr>                     <dbl>
## 1 apartments                 16.6
## 2 higher                     20.6
## 3 lower                      17.3
## 4 middle                     20.5
# creating new variable of time_group
pizza_new_area <- pizza_new_area %>% filter(!is.na(time)) %>% 
    mutate(time2 = str_replace(time, pattern = ":", 
        replacement = ".")) %>% mutate(time2 = as.numeric(time2)) %>% 
    mutate(time_group = (case_when(time2 > 4 & time2 <= 
        6 ~ "early_rush", time2 > 6 & time2 <= 8 ~ 
        "late_rush", time2 > 8 ~ "night")))


# regression for tip_perc for tip percentage based
# on area groups and time of delivery
pizza_new_area <- pizza_new_area %>% filter(tip_perc != 
    Inf)
fit <- lm(tip_perc ~ new_area_class * time_group, data = pizza_new_area)
summary(fit)
## 
## Call:
## lm(formula = tip_perc ~ new_area_class * time_group, data = pizza_new_area)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.422  -6.848  -1.952   2.958 217.492 
## 
## Coefficients:
##                                          Estimate Std. Error t value Pr(>|t|)  
## (Intercept)                               15.6662     7.2813   2.152   0.0325 *
## new_area_classhigher                       0.7590     8.6154   0.088   0.9299  
## new_area_classlower                        1.4948    11.1224   0.134   0.8932  
## new_area_classmiddle                       2.1048     8.3656   0.252   0.8016  
## time_grouplate_rush                        0.2507     8.3265   0.030   0.9760  
## time_groupnight                            1.2733     8.1949   0.155   0.8767  
## new_area_classhigher:time_grouplate_rush   9.6635    10.4024   0.929   0.3539  
## new_area_classlower:time_grouplate_rush   -0.4676    12.9046  -0.036   0.9711  
## new_area_classmiddle:time_grouplate_rush   1.5285     9.9204   0.154   0.8777  
## new_area_classhigher:time_groupnight      -0.6643    10.5795  -0.063   0.9500  
## new_area_classlower:time_groupnight       -4.8455    13.1602  -0.368   0.7131  
## new_area_classmiddle:time_groupnight      10.3775    10.5181   0.987   0.3249  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.59 on 224 degrees of freedom
## Multiple R-squared:  0.04214,    Adjusted R-squared:  -0.004903 
## F-statistic: 0.8958 on 11 and 224 DF,  p-value: 0.5453


Looking at the results from the linear regression, there does not seem to be any areas or time groups that are significantly different in tip percentage. The intercept indicates that "apartment areas" during early rush tip an average of 15.6662%. Analyzing the coefficent of new_area_classhigher, controlling for time group, the higher class tips an average .76% higher more compared to the "apartment area". What is very interesting is that, when controlling for time, the 'lower' and 'middle' class area tips higher on average than than 'higher' class area does. However, when time is a factor, tip percentage in lower class areas is shown to decrease in the night group and the rush group.


Plot the regression

# plotting the regression
pizza_new_area %>% ggplot(aes(new_area_class, tip_perc, 
    color = time_group)) + geom_point() + geom_smooth(method = "lm", 
    se = F) + geom_jitter()

Checking assumptions

# checking for linearity
plot(fit, 1)  #line is essentially horizontal, indicating linearity, not overall great

# checking for normality
resids <- fit$residuals
fitvals <- fit$fitted.values
par(mfrow = c(1, 2))
hist(resids)
qqnorm(resids)
qqline(resids, col = "red")  #looks mostly normal, but a few outliers (high tip_perc) are disturbing the normality

# checking for homoskedasticity
bptest(fit)  #pvalue>.05, so fail to reject the null hypothesis, homoskedasticity is met
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 14.852, df = 11, p-value = 0.1894
# using robust standard errors on original fit
# object:
coeftest(fit, vcov = vcovHC(fit))
## 
## t test of coefficients:
## 
##                                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              15.66624    1.20838 12.9647   <2e-16
## new_area_classhigher                      0.75896    1.84266  0.4119   0.6808
## new_area_classlower                       1.49478    2.03328  0.7352   0.4630
## new_area_classmiddle                      2.10483    2.83542  0.7423   0.4587
## time_grouplate_rush                       0.25071    1.84168  0.1361   0.8918
## time_groupnight                           1.27332    2.63543  0.4832   0.6295
## new_area_classhigher:time_grouplate_rush  9.66354    7.51776  1.2854   0.2000
## new_area_classlower:time_grouplate_rush  -0.46763    2.66537 -0.1754   0.8609
## new_area_classmiddle:time_grouplate_rush  1.52846    3.67315  0.4161   0.6777
## new_area_classhigher:time_groupnight     -0.66433    3.61059 -0.1840   0.8542
## new_area_classlower:time_groupnight      -4.84550    3.65787 -1.3247   0.1866
## new_area_classmiddle:time_groupnight     10.37754   15.57269  0.6664   0.5058
##                                             
## (Intercept)                              ***
## new_area_classhigher                        
## new_area_classlower                         
## new_area_classmiddle                        
## time_grouplate_rush                         
## time_groupnight                             
## new_area_classhigher:time_grouplate_rush    
## new_area_classlower:time_grouplate_rush     
## new_area_classmiddle:time_grouplate_rush    
## new_area_classhigher:time_groupnight        
## new_area_classlower:time_groupnight         
## new_area_classmiddle:time_groupnight        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# how much variation is explained in the model?
summary(fit)$r.sq  #this is a horrible model oh my
## [1] 0.04213543


Even after using robust stanndard errors on the data, there is nothing too significant about a certain time group or area class. The coefficents stay the same. The model does an abysmal job of explaining the proportion of variation in the dataset, only about 4.2% is explained using Multiple R squared (ouch)


Bootstrapped Standard Errors


# sampling rows with replacement
boot_dat <- sample_frac(pizza_new_area, replace = T)

samp_distn <- replicate(5000, {
    boot_dat <- sample_frac(pizza_new_area, replace = T)
    fit <- lm(tip_perc ~ new_area_class * time_group, 
        data = boot_dat)
    coef(fit)
})

## Estimated/boostrap SEs
samp_distn %>% t %>% as.data.frame %>% summarize_all(sd)
##   (Intercept) new_area_classhigher new_area_classlower new_area_classmiddle
## 1    1.084183             1.715218            1.826959             2.548431
##   time_grouplate_rush time_groupnight new_area_classhigher:time_grouplate_rush
## 1            1.643638        2.479105                                 7.040127
##   new_area_classlower:time_grouplate_rush
## 1                                2.412218
##   new_area_classmiddle:time_grouplate_rush new_area_classhigher:time_groupnight
## 1                                 3.215639                             3.440425
##   new_area_classlower:time_groupnight new_area_classmiddle:time_groupnight
## 1                                  NA                             14.65611
# CIs
samp_distn %>% t %>% as.data.frame %>% na.omit() %>% 
    pivot_longer(1:12) %>% group_by(name) %>% summarize(lower = quantile(value, 
    0.025), upper = quantile(value, 0.975))
## # A tibble: 12 x 3
##    name                                      lower upper
##    <chr>                                     <dbl> <dbl>
##  1 (Intercept)                               13.5  17.8 
##  2 new_area_classhigher                      -2.59  4.12
##  3 new_area_classhigher:time_grouplate_rush  -1.98 24.7 
##  4 new_area_classhigher:time_groupnight      -7.75  5.93
##  5 new_area_classlower                       -1.75  5.34
##  6 new_area_classlower:time_grouplate_rush   -5.19  4.22
##  7 new_area_classlower:time_groupnight      -11.9   1.31
##  8 new_area_classmiddle                      -2.27  7.62
##  9 new_area_classmiddle:time_grouplate_rush  -5.14  7.50
## 10 new_area_classmiddle:time_groupnight     -10.1  44.0 
## 11 time_grouplate_rush                       -2.98  3.50
## 12 time_groupnight                           -3.04  6.57


Using bootstrapped standard errrors and a 95% CI greatly changed the intercept compared to using robust standard errors. Using this technique, the average tip percentage an "apartment" area during early rush is 1.09 standard deviations away from the mean, which the average percentage for this group is between 13.38% and 17.66% (using a 95% confidence interval). The higher class area during early rush tips anywhere between 10.837% and 21.79% using a 95% CI. The widest range group and time for tip_perc is in the area "middle" and time "night", with a 95% CI between 3.11% and 61.09%.


Logistic Regression Model #1


Similar concept here. I wanted to see if I could predict housing type from percentage tipped and time (hour:minute, rather than time group) of delivery

# predict house type from perc_tip and time? OR
# predict tip method from perc_tip and time
pizza_fit <- pizza_new_area %>% mutate(y = ifelse(apartment_or_house == 
    "H", 1, 0)) %>% ungroup()

# finding average time of deliveries
pizza_fit %>% summarize(mean(time2))
## # A tibble: 1 x 1
##   `mean(time2)`
##           <dbl>
## 1          7.18
# mean center time, since time is never 0
pizza_fit$time2 <- pizza_fit$time2 - mean(pizza_fit$time2)


# running logistic regression
fit3 <- glm(y ~ tip_perc + time2, data = pizza_fit, 
    family = binomial(link = "logit"))

coeftest(fit3)
## 
## z test of coefficients:
## 
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.372773   0.247836  1.5041 0.132551   
## tip_perc     0.015658   0.012000  1.3048 0.191957   
## time2       -0.231818   0.075707 -3.0620 0.002198 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef(fit3))
## (Intercept)    tip_perc       time2 
##   1.4517553   1.0157814   0.7930905
Pr_of_house_mean_time = 1.4517/(1 + 1.4517)

# average tip percentage from houses vs apartment
pizza_new_area %>% group_by(apartment_or_house) %>% 
    summarize(mean(tip_perc))
## # A tibble: 2 x 2
##   apartment_or_house `mean(tip_perc)`
##   <chr>                         <dbl>
## 1 A                              16.3
## 2 H                              20.2


Predicted odds of delivering to a House when mean time of deliveries (7:18 pm) and tip percentage is zero is 1.4517, or a probability of 59.21%. Controlling for time, for every 1 unit increase in tip percentage, odds of delivering to a House go up by a factor of 1.0157. For every 1 minute increase in time, the odds of delivering to a house increase by a factor of .7930.

# confusion matrix
pizza_fit$prob <- predict(fit3, type = "response")
pizza_fit$predicted <- ifelse(pizza_fit$prob > 0.5, 
    "House", "Apartment")
table(truth = pizza_fit$apartment_or_house, prediction = pizza_fit$predicted) %>% 
    addmargins
##      prediction
## truth Apartment House Sum
##   A          15    67  82
##   H          10   144 154
##   Sum        25   211 236
# for (Accuracy, Sensitivity, Specificity,
# Precision, AUC)
class_diag <- function(probs, truth) {
    
    if (is.numeric(truth) == FALSE & is.logical(truth) == 
        FALSE) 
        truth <- as.numeric(truth) - 1
    
    tab <- table(factor(probs > 0.5, levels = c("FALSE", 
        "TRUE")), truth)
    prediction <- ifelse(probs > 0.5, 1, 0)
    acc = mean(truth == prediction)
    sens = mean(prediction[truth == 1] == 1)
    spec = mean(prediction[truth == 0] == 0)
    ppv = mean(truth[prediction == 1] == 1)
    f1 = 2 * (sens * ppv)/(sens + ppv)
    
    # CALCULATE EXACT AUC
    ord <- order(probs, decreasing = TRUE)
    probs <- probs[ord]
    truth <- truth[ord]
    
    TPR = cumsum(truth)/max(1, sum(truth))
    FPR = cumsum(!truth)/max(1, sum(!truth))
    
    dup <- c(probs[-1] >= probs[-length(probs)], FALSE)
    TPR <- c(0, TPR[!dup], 1)
    FPR <- c(0, FPR[!dup], 1)
    
    n <- length(TPR)
    auc <- sum(((TPR[-1] + TPR[-n])/2) * (FPR[-1] - 
        FPR[-n]))
    
    data.frame(acc, sens, spec, ppv, auc)
}

# finding accuracy, sensitivity (TPR), specificity
# (TNR), precision
class_diag(pizza_fit$prob, pizza_fit$y)
##         acc      sens      spec       ppv       auc
## 1 0.6737288 0.9350649 0.1829268 0.6824645 0.6256731
# density plot of the log-odds (logit)
pizza_fit$logit <- predict(fit3)
pizza_fit %>% ggplot(aes(logit, fill = apartment_or_house)) + 
    geom_density(alpha = 0.3) + geom_vline(xintercept = 0, 
    lty = 2)

# ROC plot
ROCplot <- ggplot(pizza_fit) + geom_roc(aes(d = y, 
    m = prob), n.cuts = 0)
ROCplot

# AUC width
calc_auc(ROCplot)
##   PANEL group       AUC
## 1     1    -1 0.6256731


The accuracy of the model predicting housing type from tip percentage from time is 67.37%. The sensitivity, or the true positive rate of the model is pretty good, at 93.5%. However, the specificity, or the true negative rate, is not so great, at 18.29%.The precision is 68.24%. The ROC plot indicates our model is not very good at distinguishing false positives from true positives, with an AUC of 62.56%. Overall, this model is not great at determining housing type from tip percentage and time of day. The density plot shows a huge amount of overlap between the predicted log odds for houses and apartments. This kinda makes sense. While I usually notice a trend of tip percentage being better around early rush, sometimes some really drunk people late at night in apartments will tip nicely. Sometimes, people who live in huge houses in nice neighborhoods will tip poorly during our busiest time of the day. It's all about the love of the game, delivering pizzas, rather than counting on rich people to be frivolous with their money and less fortunate people to be stingy.


Logistic Regression Model #2


This time, I re-ran the logistic regression model with all variables in the data set (that did not overlap in information). Then, after performing Lasso on the regression, I chose the best (most predictive) variables, then Cross Validated the model to make sure the AUC did not indicate overfitting.

# lg for all variables that don't overlap each
# other
pizza2 <- pizza2 %>% na.omit() %>% mutate(y = ifelse(apartment_or_house == 
    "H", 1, 0)) %>% select(-apartment_or_house)
fit <- glm(y ~ area + tip_perc + rate + tip_method + 
    hours_worked + position_worked, data = pizza2, 
    family = "binomial")

probs1 <- predict(fit, type = "response")
class_diag(probs1, pizza2$y)
##         acc     sens     spec       ppv       auc
## 1 0.9491525 0.974026 0.902439 0.9493671 0.9916455
# CV time
k = 10

data <- pizza2 %>% sample_frac  #put rows of dataset in random order
folds <- ntile(1:nrow(data), n = 10)  #create fold labels

diags <- NULL
for (i in 1:k) {
    train <- data[folds != i, ]  #create training set (all but fold i)
    test <- data[folds == i, ]  #create test set (just fold i)
    truth <- test$y
    
    fit <- glm(y ~ area + tip_perc + rate + tip_method + 
        hours_worked + position_worked, data = train, 
        family = "binomial")
    probs <- predict(fit, newdata = test, type = "response")
    
    diags <- rbind(diags, class_diag(probs, truth))
}

summarize_all(diags, mean)  #LOTS of overfitting
##         acc      sens      spec       ppv       auc
## 1 0.9106884 0.9255634 0.8734488 0.9367489 0.9708016


Using all the variables to predict housing type, the accuracy was astonishingly high, at 94.91%, along with a great TPR of 97.4%, TNR of 90.24% and an AUC of 99.16%. After running a cross-validation to see how well the data can predict housing type on new variables, the AUC dropped dramatically to 69.66%, as expected, which indicates overfitting.

# lasso
library(glmnet)
y <- as.matrix(pizza2$y)
x <- model.matrix(y ~ area + tip_perc + rate + tip_method + 
    hours_worked + position_worked, data = pizza2)[, 
    -1]

cv <- cv.glmnet(x, y, family = "binomial")

cv <- cv.glmnet(x, y, family = "binomial")
lasso <- glmnet(x, y, family = "binomial", lambda = cv$lambda.1se)
coef(lasso)
## 23 x 1 sparse Matrix of class "dgCMatrix"
##                            s0
## (Intercept)          2.583601
## area10               .       
## area11               .       
## area12               .       
## area13               .       
## area14              -2.562267
## area15              -4.681729
## area16              -2.100138
## area2                .       
## area3                .       
## area4                .       
## area5                .       
## area6                .       
## area7                .       
## area8                .       
## area9                .       
## areaMAIN            -5.468004
## tip_perc             .       
## rate                 .       
## tip_methodwrite in   .       
## hours_worked         .       
## position_workedLate  .       
## position_workedRush  .
# cross-validating lasso model
set.seed(1234)
k = 10

data <- pizza2 %>% sample_frac
folds <- ntile(1:nrow(data), n = 10)

diags <- NULL
for (i in 1:k) {
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$y
    
    fit <- glm(y ~ area, data = train, family = "binomial")
    probs <- predict(fit, newdata = test, type = "response")
    
    diags <- rbind(diags, class_diag(probs, truth))
}

diags %>% summarize_all(mean)
##         acc     sens  spec       ppv       auc
## 1 0.9190217 0.929325 0.875 0.9530369 0.9785384


A lasso was performed to indicate which variables are the best at predicting housing type,in which only areas 14,15,16,7, and MAIN were needed to predict housing type. For clarity purposes, I used the whole "area" variable, rather than parsing out the areas into sections. Re-running a CV on the fit model built using the lasso output, the AUC increased to 97.8%, close to the original fit AUC of 99.16%. All other in sample classification diagnostics between original fit and lasso fit were almost identical.


I have to admit that this was very disappointing. I was hoping to be able to predict housing type like a pizza wizard off of something more unrelated than area. Most areas are broken into houses vs apartments anyway, so the result is not very suprising. Oh well.


Random Stats


Which apartment tips the best?

#visualizing by stats which apartment tips best percentage-wise

#fixing pizza2 to have apartment or house variable again:
pizza %>% 
  group_by(Date) %>% 
  filter(!is.na(tip)) %>% 
  filter(!is.na(total)) %>% 
  mutate(total_tips=sum(tip)) %>% 
  mutate(rate=(total_tips/hours_worked))->pizza2

pizza2 %>% 
  filter(apartment_or_house=="A") %>% 
  group_by(street_name) %>% 
  filter(!tip_perc>60) %>% #getting rid of the outlier
  summarize(avg_tip_perc=mean(tip_perc)) %>%
  arrange(desc(avg_tip_perc))
## # A tibble: 28 x 2
##    street_name            avg_tip_perc
##    <chr>                         <dbl>
##  1 "Altis "                       20.6
##  2 "Grand Oaks Loop"              20.0
##  3 "2930 Grand Oaks Loop"         19.5
##  4 "11908 Anderson Mill "         18.9
##  5 "Altis"                        17.7
##  6 "1900 Little Elm Trl"          17.6
##  7 "2201 Lakeline "               17.2
##  8 "2850 Lakeline"                17.0
##  9 "2801 Lakeline"                16.7
## 10 "2304 Lakeline"                16.7
## # … with 18 more rows
pizza2 %>%  
  filter(apartment_or_house=="A") %>% 
  group_by(street_name) %>% 
  filter(!tip_perc>60) %>% #getting rid of the outlier
  mutate(avg_tip_perc=mean(tip_perc)) %>% 
  ggplot()+
  geom_jitter(aes(x=1:109,y=tip_perc))+
  geom_jitter(aes(x=20, y=avg_tip_perc), color="red", shape="square")+
  facet_wrap(~street_name)

Correlation heat map of all variables

# correlation for all numeric variables
pizza_new_area <- pizza_new_area %>% select(-time2)
ggcorr(pizza_new_area, label = TRUE)

# for write-ins at apartments, whats the mean
# tip_perc?
pizza2 %>% filter(apartment_or_house == "A") %>% filter(tip_method == 
    "write in") %>% group_by(street_name) %>% summarize(mean = mean(tip_perc)) %>% 
    arrange(desc(mean))
## # A tibble: 10 x 2
##    street_name             mean
##    <chr>                  <dbl>
##  1 "350 Cypress Creek"     22.6
##  2 "11908 Anderson Mill "  21.1
##  3 "2000 Lakeline"         20.7
##  4 "Altis "                20.6
##  5 "2930 Grand Oaks Loop"  19.5
##  6 "Altis"                 14.0
##  7 "13010 Ridgeline Blvd"  12.3
##  8 "12638 Ridgeline Blvd"  11.2
##  9 "2829 Lakeline"          0  
## 10 "3405 El Salido Pkwy"    0
# for cash orders at apartments, what's the mean
# tip_perc?
pizza2 %>% filter(apartment_or_house == "A") %>% filter(tip_method == 
    "cash order") %>% group_by(street_name) %>% summarize(mean = mean(tip_perc)) %>% 
    arrange(desc(mean))
## # A tibble: 4 x 2
##   street_name      mean
##   <chr>           <dbl>
## 1 32018 El Salido 18.6 
## 2 2850 Lakeline   17.8 
## 3 2000 Lakeline   17.0 
## 4 1600 Lakeline    7.22
# for all cash orders, whats the mean tip_perc for
# each day?
pizza2 %>% filter(tip_method == "cash order") %>% summarize(mean = mean(tip_perc)) %>% 
    arrange(desc(mean))
## # A tibble: 9 x 2
##   Date       mean
##   <chr>     <dbl>
## 1 10/15/20 54.3  
## 2 10/23/20 37.0  
## 3 11/15/20 25.4  
## 4 10/31/20 23.4  
## 5 10/8/20  18.6  
## 6 11/1/20  17.8  
## 7 10/9/20  17.7  
## 8 10/18/20 12.1  
## 9 11/12/20  0.806